-
Notifications
You must be signed in to change notification settings - Fork 15
feat: Enable CLI support for content variants and custom extensions #46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
📝 WalkthroughWalkthroughAdds support for pre-parsed distribution dictionaries to the deploy API, updates the CLI to parse pipe-separated distribution strings into dicts, and introduces a SPARQL query constant with a helper to parse aggregated content-variant strings. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (2)
tests/test_deploy.py (1)
19-47: Test coverage is good; consider adding formatExtension assertion in the last test.The tests comprehensively cover multiple scenarios. However, the final test case (lines 44-46) validates only
compressionbut doesn't verify thatformatExtensionisNoneas expected.🔎 Suggested test completeness improvement
# Test with compression only (no format extension) result = parse_distribution_str("http://example.com/data|.gz") assert result["url"] == "http://example.com/data" assert result["compression"] == "gz" + assert result["formatExtension"] is None + assert result["variants"] == {}databusclient/cli.py (1)
15-57: Fix indentation and consider extracting the compression list.
Line 50: The
Line 37: The hardcoded compression extensions list is a pragmatic heuristic, but extracting it as a module-level constant would improve maintainability:
🔎 Suggested improvements
+# Common compression file extensions +COMPRESSION_EXTENSIONS = {'.gz', '.zip', '.br', '.tar', '.zst'} + def parse_distribution_str(dist_str: str): """ Parses a distribution string with format: URL|key=value|...|.extension Returns a dictionary suitable for the deploy API. """ parts = dist_str.split('|') url = parts[0].strip() variants = {} format_ext = None compression = None # Iterate over the modifiers (everything after the URL) for part in parts[1:]: part = part.strip() # Case 1: Extension (starts with .) if part.startswith('.'): # purely heuristic: if it looks like compression (gz, zip, br), treat as compression # otherwise treat as format extension - if part.lower() in ['.gz', '.zip', '.br', '.tar', '.zst']: + if part.lower() in COMPRESSION_EXTENSIONS: compression = part.lstrip('.') # remove leading dot for API compatibility if needed else: format_ext = part.lstrip('.') # Case 2: Content Variant (key=value) elif '=' in part: key, value = part.split('=', 1) variants[key.strip()] = value.strip() # Case 3: Standalone tag (treat as boolean variant or ignore? # For now, we assume it's a value for a default key or warn) else: - print(f"WARNING: Unrecognized modifier '{part}' in distribution. Expected '.ext' or 'key=val'.") + print(f"WARNING: Unrecognized modifier '{part}' in distribution. Expected '.ext' or 'key=val'.")
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
databusclient/api/deploy.pydatabusclient/api/queries.pydatabusclient/cli.pytests/test_deploy.py
🧰 Additional context used
🧬 Code graph analysis (1)
tests/test_deploy.py (1)
databusclient/cli.py (1)
parse_distribution_str(15-57)
🔇 Additional comments (5)
databusclient/api/queries.py (1)
9-51: LGTM: Well-structured SPARQL query with proper variant handling.The query correctly uses GROUP_CONCAT with OPTIONAL to handle distributions that may lack content variants, and appropriately filters development versions.
databusclient/api/deploy.py (3)
176-208: LGTM: Correctly extracts metadata from pre-parsed dictionaries.The function properly handles optional fields and falls back to loading file stats when necessary. The defaults for
formatExtension("file") andcompression("none") align with the legacy inference logic.Note: The function expects
byteSizein the input dict (line 202) but returnscontent_length, which matches the API entity field naming convention.
304-348: LGTM: Clear documentation of dual-format distributions.The updated signature and docstring clearly communicate that distributions can be either legacy strings or pre-parsed dictionaries, maintaining backward compatibility.
366-387: LGTM: Clean dual-format handling with proper type discrimination.The implementation correctly routes dict and string distributions through their respective parsers while maintaining consistent downstream processing.
databusclient/cli.py (1)
130-138: LGTM: Clean integration of distribution parsing into the deploy workflow.The list comprehension correctly transforms raw distribution strings into structured dictionaries before passing them to the API layer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (1)
databusclient/cli.py (1)
1-12: Past review comment about unusedreimport appears resolved.The previous review mentioned an unused
reimport at line 4, but the current code showsfrom typing import Listat that line. The issue appears to have been addressed.
🧹 Nitpick comments (2)
databusclient/cli.py (2)
14-14: Add return type annotation.The function should include a return type annotation for better type safety and documentation.
🔎 Proposed fix
+from typing import List, Dict, Any + -def parse_distribution_str(dist_str: str): +def parse_distribution_str(dist_str: str) -> Dict[str, Any]:
49-49: Useclick.echo()instead ofprint()for consistency.For CLI applications using Click, prefer
click.echo()overprint()for better cross-platform compatibility and consistency with the rest of the CLI output.🔎 Proposed fix
- print(f"WARNING: Unrecognized modifier '{part}' in distribution. Expected '.ext' or 'key=val'.") + click.echo(f"WARNING: Unrecognized modifier '{part}' in distribution. Expected '.ext' or 'key=val'.")
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
databusclient/api/queries.pydatabusclient/cli.pytests/test_deploy.py
🚧 Files skipped from review as they are similar to previous changes (1)
- databusclient/api/queries.py
🧰 Additional context used
🧬 Code graph analysis (1)
databusclient/cli.py (1)
databusclient/api/deploy.py (1)
create_dataset(304-465)
🔇 Additional comments (1)
tests/test_deploy.py (1)
15-16: LGTM!Good refactoring to define the example URL as a constant. This improves maintainability and follows DRY principles.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
tests/test_parse_distribution.py (4)
92-121: Consider adding a test for case-insensitive compression detection.The implementation uses
.lower()to match compression extensions (as seen in the relevant code snippet), but there's no test verifying that.GZ,.Zip, etc. are correctly detected. This ensures the case-insensitive behavior is maintained.🔎 Suggested test addition
Add this test to the compression detection section:
def test_compression_case_insensitive(self): """Test that compression detection is case-insensitive.""" result = parse_distribution_str("http://example.com/file|.GZ") assert result["compression"] == "gz" result = parse_distribution_str("http://example.com/file|.Zip") assert result["compression"] == "zip"
21-30: Consider adding a test for URLs containing pipe characters.The current parsing logic splits on
|to separate the URL from modifiers. If a URL contains a pipe character (though rare, it's valid in URLs), the parsing would break. Consider adding a test to document this limitation or suggest URL encoding.🔎 Suggested test addition
Add this test to the URL extraction section:
def test_url_with_pipe_character(self): """Test handling of URLs containing pipe characters (edge case).""" # URLs can technically contain pipe characters # This test documents current behavior (would break parsing) # Consider URL encoding pipes as %7C in practice url_with_pipe = "http://example.com/data?filter=a%7Cb" # %7C is URL-encoded pipe result = parse_distribution_str(f"{url_with_pipe}|lang=en") assert result["url"] == url_with_pipe
63-121: Consider using parametrized tests to reduce duplication.The format extension tests (lines 63-87) and compression detection tests (lines 92-121) follow a repetitive pattern. Using
pytest.mark.parametrizecould make these more maintainable and concise.🔎 Example parametrized test refactor
Replace the individual format extension tests with:
@pytest.mark.parametrize("extension,expected", [ ("json", "json"), ("ttl", "ttl"), ("csv", "csv"), ("xml", "xml"), ]) def test_format_extension_detection(self, extension, expected): """Test format extension detection for various types.""" result = parse_distribution_str(f"http://example.com/file|.{extension}") assert result["formatExtension"] == expectedSimilarly for compression tests:
@pytest.mark.parametrize("compression,expected", [ ("gz", "gz"), ("zip", "zip"), ("br", "br"), ("tar", "tar"), ("zst", "zst"), ]) def test_compression_detection(self, compression, expected): """Test compression detection for various types.""" result = parse_distribution_str(f"http://example.com/file|.{compression}") assert result["compression"] == expected
126-146: Consider adding a test for multiple extensions of the same type.The current implementation would overwrite if multiple format extensions (e.g.,
|.json|.xml) or multiple compression types (e.g.,|.gz|.zip) are specified. A test documenting this behavior would help prevent confusion.🔎 Suggested test addition
Add to the edge cases section:
def test_multiple_format_extensions_last_wins(self): """Test that when multiple format extensions are specified, the last one wins.""" result = parse_distribution_str("http://example.com/file|.json|.xml") # Last extension specified should be used assert result["formatExtension"] == "xml" def test_multiple_compression_types_last_wins(self): """Test that when multiple compression types are specified, the last one wins.""" result = parse_distribution_str("http://example.com/file|.gz|.zip") # Last compression specified should be used assert result["compression"] == "zip"
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tests/test_parse_distribution.py
🧰 Additional context used
🧬 Code graph analysis (1)
tests/test_parse_distribution.py (2)
databusclient/cli.py (1)
parse_distribution_str(14-56)databusclient/api/deploy.py (2)
create_dataset(304-465)_get_file_info_from_dict(176-208)
🔇 Additional comments (4)
tests/test_parse_distribution.py (4)
1-11: LGTM! Imports are appropriate for the test scope.The imports correctly include the main function under test (
parse_distribution_str) and the deployment API components needed for integration testing. Importing the private function_get_file_info_from_dictis acceptable for integration testing purposes.
151-176: LGTM! Edge cases are well-tested.The tests properly verify whitespace handling, standalone tag warnings (including both stdout capture and result validation), and the minimal URL-only case. The test at lines 158-166 correctly ensures that unrecognized modifiers trigger a warning and are not added to the variants dict.
181-208: LGTM! Integration tests correctly verify parsed dictionaries work with deploy API.The mocking strategy is appropriate, using
_load_file_statsto avoid actual file operations. The test correctly verifies that:
- Parsed variants, extensions, and compression are properly extracted
- Default values ("file" for extension, "none" for compression) are applied when not specified
- The SHA-256 hash length is correct (64 characters)
209-275: LGTM! Comprehensive integration tests verify end-to-end functionality.These tests effectively verify that:
- Parsed distribution dictionaries are correctly consumed by
create_dataset- The generated dataset structure includes proper @context and @graph
- Distribution fields are correctly mapped, including content variants prefixed with
dcv:- Multiple distributions with different language variants are properly represented
The approach of using generator expressions with
next()(lines 234-237, 266-269) to locate specific graphs is clean and idiomatic.
|
@Integer-Ctrl Could u please review my pr . |
Integer-Ctrl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some general feedback. I commented on #32 to gain more insight and clarity on the issue/goal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
File seems useless? Is ONTOLOGIES_QUERY or parse_content_variants_string used at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback. The usage is currently indirect, which I agree is not very clear. I’ll add clarification (or refactor) to make the purpose and usage of ONTOLOGIES_QUERY / parse_content_variants_string more explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to keep cli.py as compact as possible and move logic (methods) always to the according CLI option (in this case, deploy.py).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to keep
cli.pyas compact as possible and move logic (methods) always to the according CLI option (in this case,deploy.py).
sure @Integer-Ctrl i would incorporate the suggested chnages
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@databusclient/api/deploy.py`:
- Around line 218-249: The function _get_file_info_from_dict currently calls
_load_file_stats(url) even when url is empty; add an explicit validation in
_get_file_info_from_dict to fail fast: if either sha256sum or content_length is
missing and the local variable url is falsy/empty, raise a clear exception
(e.g., ValueError) indicating the missing required "url" instead of calling
_load_file_stats; keep the existing fallback to _load_file_stats only when url
is present. Ensure references to url, sha256sum, content_length,
_get_file_info_from_dict and _load_file_stats are used so reviewers can locate
the change.
In `@databusclient/cli.py`:
- Around line 133-147: The create_dataset call in cli.py has a syntax error
(missing comma) and passes the wrong variable: remove the duplicate positional
args (first line with version_id, title, abstract, description, license_url,
parsed_distributions) or add the missing comma and replace the final
distributions argument with parsed_distributions so api_deploy.create_dataset
receives the structured list; update the call that references
api_deploy.create_dataset (and ensure you constructed parsed_distributions via
parse_distribution_str) to use keyword args version_id / artifact_version_title
/ artifact_version_abstract / artifact_version_description / license_url and
distributions=parsed_distributions.
| def _get_file_info_from_dict(dist_dict: Dict[str, any]) -> Tuple[Dict[str, str], str, str, str, int]: | ||
| """ | ||
| Extract file info from a pre-parsed distribution dictionary. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| dist_dict : dict | ||
| A dictionary with keys: url, variants, formatExtension, compression | ||
| (as returned by parse_distribution_str in cli.py) | ||
|
|
||
| Returns | ||
| ------- | ||
| Tuple containing: | ||
| - cvs: Dict of content variants | ||
| - format_extension: File format extension | ||
| - compression: Compression type | ||
| - sha256sum: SHA-256 hash of file | ||
| - content_length: File size in bytes | ||
| """ | ||
| url = dist_dict.get("url", "") | ||
| cvs = dist_dict.get("variants", {}) | ||
| format_extension = dist_dict.get("formatExtension") or "file" | ||
| compression = dist_dict.get("compression") or "none" | ||
|
|
||
| # Check if sha256sum and content_length are provided | ||
| sha256sum = dist_dict.get("sha256sum") | ||
| content_length = dist_dict.get("byteSize") | ||
|
|
||
| # If not provided, load from URL | ||
| if sha256sum is None or content_length is None: | ||
| sha256sum, content_length = _load_file_stats(url) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Validate required url before fallback download.
If a caller passes a dict without url, _load_file_stats("") raises a low-signal requests error. Fail fast with a clear exception before any network call.
🛠️ Proposed fix
-def _get_file_info_from_dict(dist_dict: Dict[str, any]) -> Tuple[Dict[str, str], str, str, str, int]:
+def _get_file_info_from_dict(dist_dict: Dict[str, any]) -> Tuple[Dict[str, str], str, str, str, int]:
"""
@@
- url = dist_dict.get("url", "")
+ url = dist_dict.get("url")
+ if not url:
+ raise BadArgumentException("Distribution dict missing required 'url' field.")🤖 Prompt for AI Agents
In `@databusclient/api/deploy.py` around lines 218 - 249, The function
_get_file_info_from_dict currently calls _load_file_stats(url) even when url is
empty; add an explicit validation in _get_file_info_from_dict to fail fast: if
either sha256sum or content_length is missing and the local variable url is
falsy/empty, raise a clear exception (e.g., ValueError) indicating the missing
required "url" instead of calling _load_file_stats; keep the existing fallback
to _load_file_stats only when url is present. Ensure references to url,
sha256sum, content_length, _get_file_info_from_dict and _load_file_stats are
used so reviewers can locate the change.
| # --- CHANGE START --- | ||
| # Parse the input strings into structured objects | ||
| parsed_distributions = [parse_distribution_str(d) for d in distributions] | ||
|
|
||
| # Note: api_deploy.create_dataset now accepts this list of dicts | ||
| dataid = api_deploy.create_dataset( | ||
| version_id, title, abstract, description, license_url, parsed_distributions | ||
| version_id=version_id, | ||
| artifact_version_title=title, | ||
| artifact_version_abstract=abstract, | ||
| artifact_version_description=description, | ||
| license_url=license_url, | ||
| distributions=distributions | ||
| ) | ||
| # --- CHANGE END --- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# Check the actual file content at the specified lines
cat -n databusclient/cli.py | sed -n '130,150p'Repository: dbpedia/databus-python-client
Length of output: 1099
🏁 Script executed:
# Also search for the create_dataset function signature to understand expected parameters
rg -n "def create_dataset" --type py -A 10Repository: dbpedia/databus-python-client
Length of output: 764
🏁 Script executed:
# Find where api_deploy is defined/imported to understand the module
rg -n "api_deploy" databusclient/cli.py | head -20Repository: dbpedia/databus-python-client
Length of output: 396
Fix syntax error in create_dataset call and use correct variable.
Line 139 is missing a comma before the keyword arguments on line 140, and you're passing distributions instead of parsed_distributions on line 145. This won't parse and breaks the CLI.
Fix
- dataid = api_deploy.create_dataset(
- version_id, title, abstract, description, license_url, parsed_distributions
- version_id=version_id,
- artifact_version_title=title,
- artifact_version_abstract=abstract,
- artifact_version_description=description,
- license_url=license_url,
- distributions=distributions
- )
+ dataid = api_deploy.create_dataset(
+ version_id=version_id,
+ artifact_version_title=title,
+ artifact_version_abstract=abstract,
+ artifact_version_description=description,
+ license_url=license_url,
+ distributions=parsed_distributions,
+ )🧰 Tools
🪛 GitHub Actions: Python CI (Lint & pytest)
[error] 140-140: Ruff check failed with SyntaxError: Expected ',', found name at databusclient/cli.py:140:13. Command: 'poetry run ruff check --output-format=github .'
🪛 GitHub Check: build
[failure] 140-140: Ruff
databusclient/cli.py:140:13: SyntaxError: Expected ',', found name
🤖 Prompt for AI Agents
In `@databusclient/cli.py` around lines 133 - 147, The create_dataset call in
cli.py has a syntax error (missing comma) and passes the wrong variable: remove
the duplicate positional args (first line with version_id, title, abstract,
description, license_url, parsed_distributions) or add the missing comma and
replace the final distributions argument with parsed_distributions so
api_deploy.create_dataset receives the structured list; update the call that
references api_deploy.create_dataset (and ensure you constructed
parsed_distributions via parse_distribution_str) to use keyword args version_id
/ artifact_version_title / artifact_version_abstract /
artifact_version_description / license_url and
distributions=parsed_distributions.
This PR updates the CLI to correctly handle custom file extensions and multiple content variants. Previously, there was no syntax to submit these specifics via the CLI, forcing users to rely on complex SPARQL queries to exclude unwanted variants. I have implemented a new pipe-separated syntax (e.g.,
URL|key=value|.ext) that allows users to pass metadata directly with the artifact URL. This implementation maintains backward compatibility with the existing underscore-separated format (key1=value1_key2=value2) while adding support for the new, more flexible pipe syntax.Summary by CodeRabbit
New Features
Tests
✏️ Tip: You can customize this high-level summary in your review settings.
issue #32